Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Outlier Suppression: Pushing the Limit of Low-Bit Transformer Language Models

133

FIGURE 5.8

Presentation of outliers over ^˜X, γ and X^′of LayerNorm on BERT-SST-2. For example, at

dimension 308, γ and ^˜X both have sharper values. By excluding γ, it can be seen that X^′

holds milder distribution than ^˜X.

FIGURE 5.9

The distribution using (mean + 3 * std) is drawn as the left border, then enumerating the

value to cut the tensor on RoBERTa-QNLI. The reﬂect the proportion of clipped tokens.

5.6.2

Gamma Migration

Speciﬁcally, the gamma migration produces a more quantization-friendly model by migrat-

ing the outlier ampliﬁer γ into subsequent modules in an equivalent transformation and

bringing more robust activation for quantization without extra computation burden. As

shown in Fig. 5.10, γ will be excluded from the LayerNorm and moved to the shortcut

branch and weight of the next layer. As a result, the LayerNorm becomes the Non-scaling

LayerNorm. The shortcut branch and weight of the next layer absorb a new parameter γ.

From Fig. 5.10, the “Quant” process quantizes X^′. Then the quantized output engages two

branches respectively. The ﬁrst is the matrix multiplication on the bottom branch. The

second is multiplying parameter γ and experiencing the “DeQuant” process. In fact, the γ

calculation is delayed from LayerNorm to the shortcut branch. Thus, this new design will

not increase the computation overhead.

FIGURE 5.10

Left: quantization ﬂow before. Right: gamma migration.